Bayesian Reinforcement Learning with Gaussian Process Temporal Difference Methods
نویسندگان
چکیده
Reinforcement Learning is a class of problems frequently encountered by both biological and artificial agents. An important algorithmic component of many Reinforcement Learning solution methods is the estimation of state or state-action values of a fixed policy controlling a Markov decision process (MDP), a task known as policy evaluation. We present a novel Bayesian approach to policy evaluation in general state and action spaces, which employs statistical generative models for value functions via Gaussian processes (GPs). The posterior distribution based on a GP-based statistical model provides us with a value-function estimate, as well as a measure of the variance of that estimate, opening the way to a range of possibilities not available up to now. We derive exact expressions for the posterior moments of the value GP, which admit both batch and recursive computations. An efficient sequential kernel sparsification method allows us to derive efficient online algorithms for learning good approximations of the posterior moments. By allowing our algorithms to evaluate state-action values we derive model-free algorithms based on Policy Iteration for improving policies, thus tackling the complete RL problem. A companion paper describes experiments conducted with the algorithms presented here.
منابع مشابه
Reinforcement learning with kernels and Gaussian processes
Kernel methods have become popular in many sub-fields of machine learning with the exception of reinforcement learning; they facilitate rich representations, and enable machine learning techniques to work in diverse input spaces. We describe a principled approach to the policy evaluation problem of reinforcement learning. We present a temporal difference (TD) learning using kernel functions. Ou...
متن کاملBayesian Policy Gradient and Actor-Critic Algorithms
Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Many conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. The policy is improved by adjusting the parameters in the direction of the gradient estimate. Since Monte-Carlo methods tend to have high variance, a large num...
متن کاملLearning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods
The Octopus arm is a highly versatile and complex limb. How the Octopus controls such a hyper-redundant arm (not to mention eight of them!) is as yet unknown. Robotic arms based on the same mechanical principles may render present day robotic arms obsolete. In this paper, we tackle this control problem using an online reinforcement learning algorithm, based on a Bayesian approach to policy eval...
متن کاملBayesian Multi-Task Reinforcement Learning
We consider the problem of multi-task reinforcement learning where the learner is provided with a set of tasks, for which only a small number of samples can be generated for any given policy. As the number of samples may not be enough to learn an accurate evaluation of the policy, it would be necessary to identify classes of tasks with similar structure and to learn them jointly. We consider th...
متن کاملBayesian Multi-Task Reinforcement Learning
We consider the problem of multi-task reinforcement learning where the learner is provided with a set of tasks, for which only a small number of samples can be generated for any given policy. As the number of samples may not be enough to learn an accurate evaluation of the policy, it would be necessary to identify classes of tasks with similar structure and to learn them jointly. We consider th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007